{
 "cells": [
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Fine-tune Falcon 180B with DeepSpeed ZeRO, LoRA & Flash Attention\n",
    "\n",
    "Falcon 180B is the newest version of Falcon LLM family. It is the biggest open source model with 180B parameter and trained on more data - 3.5T tokens with context length window upto 4K tokens. In this example we will show how to fine-tune Falcon 180B using DeepSpeed, Hugging Face Transformers, LoRA with Flash Attention on a multi-GPU machine.\n",
    "\n",
    "In detail you will learn how to:\n",
    "1. Setup Development Environment\n",
    "2. Load and prepare the dataset\n",
    "3. Fine-Tune Falcon 180B using DeepSpeed, Hugging Face Transformers, LoRA with Flash Attention\n",
    "\n",
    "Before we get into the code lets take a quick look on the technologies and methods we are going to use: \n",
    "\n",
    "### What is DeepSpeed ZeRO?\n",
    "\n",
    "DeepSpeed ZeRO focuses on efficient large-scale training of Transformers. ZeRO, or Zero Redundancy Optimizer, reduces memory footprint by partitioning model states across devices instead of basic data parallelism. This saves significant memory - ZeRO-Infinity can reduce usage 100x vs data parallelism. ZeRO-Offload further reduces memory by offloading parts of model and optimizer to CPU, enabling 10B+ parameter models on 1 GPU. ZeRO [integrates with HuggingFace Transformers through a configuration file](https://huggingface.co/docs/transformers/v4.26.1/en/main_classes/deepspeed).\n",
    "\n",
    "### What is LoRA?\n",
    "\n",
    "[LoRA](https://arxiv.org/abs/2106.09685) enables efficient fine-tuning of large language models. It decomposes weight matrices into smaller, trainable update matrices that adapt while keeping original weights frozen. This drastically reduces trainable parameters for faster, lower-memory tuning. LoRA integrates into [Transformers via Hugging Face's PEFT](https://huggingface.co/docs/peft/conceptual_guides/lora). It combines well with methods like DeepSpeed. Key advantages are efficient tuning, portable models, and no inference latency when merging trained weights. LoRA allows adaptively training massive models with limited resources.\n",
    "\n",
    "### What is Flash Attention?\n",
    "\n",
    "Flash Attention is an algorithm that speeds up the core attention mechanism in Transformer language models by restructuring computations. It uses techniques like tiling and recomputation to reduce the high memory costs of attention, enabling models to process longer text sequences. Flash Attention 2 optimizes parallelism and work partitioning for 2x speedup over the previous version, reaching 230 TFLOPS/s on A100 GPUs.\n",
    "\n",
    "\n",
    "### Access Falcon 180B \n",
    "\n",
    "Before we can start training we have to make sure that we accepted the license [tiiuae/falcon-180B](https://huggingface.co/tiiuae/falcon-180B) to be able to use it. You can accept the license by clicking on the Agree and access repository button on the model page at: \n",
    "* [tiiuae/falcon-180B](https://huggingface.co/tiiuae/falcon-180B)\n",
    "\n",
    "> The example was created and run a DGX A100 8-GPU machine with 80GB GPU memory per GPU.\n",
    "\n",
    "## 1. Setup Development Environment\n",
    "\n",
    "conda create --name hf python=3.10 -c conda-forge\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# install torch with the correct cuda version, check nvcc --version\n",
    "!pip install torch --extra-index-url https://download.pytorch.org/whl/cu118 --upgrade\n",
    "# install Hugging Face Libraries and additional dependencies\n",
    "!pip install \"transformers==4.34.0\" \"datasets==2.14.5\" \"accelerate==0.22.0\" \"evaluate==0.4.0\" \"peft==0.5.0\" tensorboard packaging --upgrade\n",
    "# install deepspeed and ninja for jit compilations of kernels\n",
    "!pip install \"deepspeed==0.10.3\" ninja --upgrade\n",
    "# install additional Flash Attention\n",
    "!pip install flash-attn --no-build-isolation --upgrade"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To access any Falcon 180B asset we need to login into our hugging face account. We can do this by running the following command:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!huggingface-cli login --token YOUR_TOKEN"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2. Load and prepare the dataset\n",
    "\n",
    "we will use the [dolly](https://huggingface.co/datasets/databricks/databricks-dolly-15k) an open source dataset of instruction-following records generated by thousands of Databricks employees in several of the behavioral categories outlined in the [InstructGPT paper](https://arxiv.org/abs/2203.02155), including brainstorming, classification, closed QA, generation, information extraction, open QA, and summarization.\n",
    "\n",
    "```python\n",
    "{\n",
    "  \"instruction\": \"What is world of warcraft\",\n",
    "  \"context\": \"\",\n",
    "  \"response\": \"World of warcraft is a massive online multi player role playing game. It was released in 2004 by bizarre entertainment\"\n",
    "}\n",
    "```\n",
    "\n",
    "To load the `samsum` dataset, we use the `load_dataset()` method from the 🤗 Datasets library."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from datasets import load_dataset\n",
    "from random import randrange\n",
    "\n",
    "# Load dataset from the hub\n",
    "dataset = load_dataset(\"databricks/databricks-dolly-15k\", split=\"train\")\n",
    "\n",
    "print(f\"dataset size: {len(dataset)}\")\n",
    "print(dataset[randrange(len(dataset))])\n",
    "# dataset size: 15011"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To instruct tune our model we need to convert our structured examples into a collection of tasks described via instructions. We define a `formatting_function` that takes a sample and returns a string with our format instruction."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def format_dolly(sample):\n",
    "    instruction = f\"### Instruction\\n{sample['instruction']}\"\n",
    "    context = f\"### Context\\n{sample['context']}\" if len(sample[\"context\"]) > 0 else None\n",
    "    response = f\"### Answer\\n{sample['response']}\"\n",
    "    # join all the parts together\n",
    "    prompt = \"\\n\\n\".join([i for i in [instruction, context, response] if i is not None])\n",
    "    return prompt\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "lets test our formatting function on a random example."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from random import randrange\n",
    "\n",
    "print(format_dolly(dataset[randrange(len(dataset))]))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In addition, to formatting our samples we also want to pack multiple samples to one sequence to have a more efficient training."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from transformers import AutoTokenizer\n",
    "\n",
    "model_id = \"tiiuae/falcon-180B\" \n",
    "tokenizer = AutoTokenizer.from_pretrained(model_id)\n",
    "tokenizer.pad_token = tokenizer.eos_token"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We define some helper functions to pack our samples into sequences of a given length and then tokenize them."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from random import randint\n",
    "from itertools import chain\n",
    "from functools import partial\n",
    "\n",
    "\n",
    "# template dataset to add prompt to each sample\n",
    "def template_dataset(sample):\n",
    "    sample[\"text\"] = f\"{format_dolly(sample)}{tokenizer.eos_token}\"\n",
    "    return sample\n",
    "\n",
    "\n",
    "# apply prompt template per sample\n",
    "dataset = dataset.map(template_dataset, remove_columns=list(dataset.features))\n",
    "# print random sample\n",
    "print(dataset[randint(0, len(dataset))][\"text\"])\n",
    "\n",
    "# empty list to save remainder from batches to use in next batch\n",
    "remainder = {\"input_ids\": [], \"attention_mask\": [], \"token_type_ids\": []}\n",
    "\n",
    "def chunk(sample, chunk_length=2048):\n",
    "    # define global remainder variable to save remainder from batches to use in next batch\n",
    "    global remainder\n",
    "    # Concatenate all texts and add remainder from previous batch\n",
    "    concatenated_examples = {k: list(chain(*sample[k])) for k in sample.keys()}\n",
    "    concatenated_examples = {k: remainder[k] + concatenated_examples[k] for k in concatenated_examples.keys()}\n",
    "    # get total number of tokens for batch\n",
    "    batch_total_length = len(concatenated_examples[list(sample.keys())[0]])\n",
    "\n",
    "    # get max number of chunks for batch\n",
    "    if batch_total_length >= chunk_length:\n",
    "        batch_chunk_length = (batch_total_length // chunk_length) * chunk_length\n",
    "\n",
    "    # Split by chunks of max_len.\n",
    "    result = {\n",
    "        k: [t[i : i + chunk_length] for i in range(0, batch_chunk_length, chunk_length)]\n",
    "        for k, t in concatenated_examples.items()\n",
    "    }\n",
    "    # add remainder to global variable for next batch\n",
    "    remainder = {k: concatenated_examples[k][batch_chunk_length:] for k in concatenated_examples.keys()}\n",
    "    # prepare labels\n",
    "    result[\"labels\"] = result[\"input_ids\"].copy()\n",
    "    return result\n",
    "\n",
    "\n",
    "# tokenize and chunk dataset\n",
    "lm_dataset = dataset.map(\n",
    "    lambda sample: tokenizer(sample[\"text\"]), batched=True, remove_columns=list(dataset.features)\n",
    ").map(\n",
    "    partial(chunk, chunk_length=2048),\n",
    "    batched=True,\n",
    ")\n",
    "\n",
    "# Print total number of samples\n",
    "print(f\"Total number of samples: {len(lm_dataset)}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "After we processed the datasets we want to save it to disk to be able to use the processed dataset later during training."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "lm_dataset.save_to_disk(\"dolly-processed\")"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 3. Fine-Tune Falcon 180B using DeepSpeed, Hugging Face Transformers, LoRA with Flash Attention\n",
    "\n",
    "DeepSpeed ZeRO is natively integrated into the [Hugging Face Transformers Trainer](https://huggingface.co/docs/transformers/v4.33.1/en/main_classes/deepspeed). The integration enables leveraging ZeRO by simply providing a DeepSpeed config file, and the Trainer takes care of the rest. We created 2 deepspeed configurations for the experiments we ran, including `CPU offloading`: \n",
    "\n",
    "- [ds_falcon_180b_z3.json](./configs/ds_falcon_180b_z3.json)\n",
    "- [ds_falcon_180b_z3_offload.json](./configs/ds_falcon_180b_z3_offload.json)\n",
    "\n",
    "As mentioned in the beginning, we ran those example using a 8x NVIDIA A100 80GB. This means we can leverage `bf16`, which reduces the memory footprint of the model by almost ~2x, which allows us to train without offloading efficiently. We are going to use the [ds_falcon_180b_z3.json](./configs/ds_falcon_180b_z3.json). If you are irritated by the `auto` values, check the [documentation](https://huggingface.co/docs/transformers/v4.26.1/en/main_classes/deepspeed#configuration).\n",
    "\n",
    "In addition to the deepspeed configuration we also need a training script, which implements LoRA and patches our model to use flash-attention. We created a [run_ds_lora.py](./run_ds_lora.py) script, which patches the falcon model using the [falcon_patch.py](./utils/falcon_patch.py) utils and implements LoRA using [peft_utils.py](./utils/peft_utils.py). \n",
    "\n",
    "> When you run make sure that you have the same folder structure and utils/configs available. The easiest way is to clone the whole repository. Go into the `training` directory and start the training.\n",
    "\n",
    "Once we made sure that we have the right configuration and training script we can start the training using `torchrun`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!torchrun --nproc_per_node 8 run_ds_lora.py \\\n",
    "  --model_id tiiuae/falcon-180B \\\n",
    "  --dataset_path dolly-processed \\\n",
    "  --output_dir falcon-180b-lora-fa \\\n",
    "  --num_train_epochs 3 \\\n",
    "  --per_device_train_batch_size 1 \\\n",
    "  --learning_rate 4e-3 \\\n",
    "  --gradient_checkpointing True \\\n",
    "  --gradient_accumulation_steps 8 \\\n",
    "  --bf16 True \\\n",
    "  --tf32 True \\\n",
    "  --use_flash_attn True \\\n",
    "  --lr_scheduler_type \"constant_with_warmup\" \\\n",
    "  --logging_steps 25 \\\n",
    "  --save_steps 100 \\\n",
    "  --save_total_limit 3 \\\n",
    "  --deepspeed configs/ds_falcon_180b_z3.json"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "_Note: Since we are using LoRA we are only saving the \"trained\" adapter weights, to save some storage. If you want to merge the adapters back into the base model and save the merged model you can add `--merge_adapters True` or use the [merge_adapter_weights.py](./scripts/merge_adapter_weights.py) script._\n",
    "\n",
    "In our example for Falcon 180B, the training time was `153 minutes` or ~2 hours for 3 epochs. For comparison the pretraining cost of Falcon 180B was ~7,000,000 GPU hours, which is 3,500,000 time more than fine-tuning.\n",
    "\n",
    "## Conclusion \n",
    "\n",
    "In the blog post you learn how to fine-tune  Falcon 180B model using DeepSpeed, Hugging Face Transformers, and LoRA with Flash Attention on a multi-GPU machine. We used: \n",
    "\n",
    "* DeepSpeed ZeRO for memory optimization, enabling training models with up to trillions of parameters on limited GPU memory. We used stage 3 (ZeRO-Infinity) to optimize memory usage.\n",
    "* Hugging Face Transformers and Datasets for easily loading and preparing the text dataset as well as providing an intuitive Trainer API.\n",
    "* LoRA, a method to efficiently fine-tune large language models by only updating a small percentage of parameters each iteration. This drastically reduces memory usage and computational costs.\n",
    "* Flash Attention - a highly optimized attention implementation that further reduces the memory footprint.\n",
    "\n",
    "Compining all of those methods allows us to fine-tune LLMs with over 100B+ parameter with limited resources. The example provides a template for efficiently tuning the largest publicly available models."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "pytorch",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.12"
  },
  "orig_nbformat": 4,
  "vscode": {
   "interpreter": {
    "hash": "2d58e898dde0263bc564c6968b04150abacfd33eed9b19aaa8e45c040360e146"
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
